Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval
نویسندگان
چکیده
Let D ={d1, d2, ...dD} be a given set of D string documents of total length n, our task is to index D, such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. We propose an index of size |CSA|+ n logD(2 + o(1)) bits and O(ts(p)+k log log n+poly log log n) query time for the basic relevance metric term-frequency, where |CSA| is the size (in bits) of a compressed full text index of D, with O(ts(p)) time for searching a pattern of length p . We further reduce the space to |CSA|+ n logD(1 + o(1)) bits, however the query time will be O(ts(p) + k(log σ log log n) 1+ǫ + poly log log n), where σ is the alphabet size and ǫ > 0 is any constant.
منابع مشابه
Improved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملTop-k document retrieval in optimal space
We present an index for top-k most frequent document retrieval whose space is |CSA|+o(n)+D log n D+O(D) bits, and its query time is O(log k log 2+ n) per reported document, where D is the number of documents, n is the sum of lengths of the documents, and |CSA| is the space of the compressed suffix array for the documents. This improves over previous results for this problem, whose space complex...
متن کاملForbidden Extension Queries
Document retrieval is one of the most fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbi...
متن کاملImproved Single-Term Top-k Document Retrieval
On natural language text collections, finding the k documents most relevant to a query is generally solved with inverted indexes. On general string collections, however, more sophisticated data structures are necessary. Navarro and Nekrich [SODA 2012] showed that a linear-space index can solve such top-k queries in optimal time O(m + k), where m is the query length. Konow and Navarro [DCC 2013]...
متن کامل